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Abstract: This paper is the first of two on the problem of estimating a function of a probability dis- 
tribution from a finite set of samples of that distribution. In this paper a Bayesian analysis of this 
problem is presented, the optimal properties of the Bayes estimators are discussed, and as an ex- 
ample of the formalism, closed form expressions for the Bayes estimators for the moments of the 
Shannon entropy function are derived. Numerical results are presented that compare the Bayes es- 
timator to the frequency-counts estimator for the Shannon entropy . 
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1. INTRODUCTION 



2 



Consider a system with m possible states and an associated m-vector of probabilities of those 
states, p = (pp, 1 < i < m, (E! 1 ^ jp ; = 1 ). The system is repeatedly and independently sampled 
according to the distribution p. Let the total number of samples be N and denote the associated 
vector of counts of states by n = (n^), 1 < i < m, ^ n- = N). By definition, n is multinomially 

distributed. 

In many cases what we are interested in is not p but some function of p, F(p). In these papers 
(this paper and Ref. 1) we are concerned with the problem of estimating some function F(p) from 
the data n. This problem is ubiquitous in physics, arising for example in dimension estimation and 
in estimating correlations from data. Some previous work on this issue (most closely related to the 
work of Ref. 2) appears in Refs. 3 4, 13, and 14. 

In Sec. 2 of this paper we introduce the Bayes estimator for F(p) given n. In Sec. 3 we discuss 
the optimal properties of Bayes estimators and discuss their relation to conventional statistical 
techniques. Section 4 contains the central mathematical results needed to calculate Bayes estima- 

tors for F(p). We then apply these results to the case where F(p) is the Shannon entropy 
S (p) = -Z.pjlog (pj) . Section 5a contains a brief calculation showing that for small sample siz- 
es there are significant differences between the Bayes and frequency-counts 
(S (n) = -£. (rij/N) log (n^N) ) estimators for the Shannon entropy. In Sec. 5b we present 

graphs of the results of a numerical comparison of the Bayes and frequency-counts estimators for 
the Shannon entropy. 

We note in passing that the intuitive notion of Shannon entropy as the "amount of missing in- 
formation" is not usually considered meaningful if the information at hand consists of data n rather 
than the underlying distribution p, since Shannon entropy is a function of p rather than of n. In the 
sense that the Bayes estimator discussed in this paper is optimal, and produces a Shannon entropy 
value from information of the form n, the Bayes estimator can be viewed as a way of defining the 
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"amount of missing information" when the information at hand consists of a finite data set n rather 
than a full distribution p. 

In the second paper in this series 1 Bayes estimators for several other functions F(p) besides the 
Shannon entropy are presented. In a paper in preparation we will present the results of a classical 
sampling distribution analysis of the problem of estimating F(p) from n for several F(p) of inter- 
est 2 . 



2. BAYESIAN ESTIMATION OF F(p) FROM COUNTS 

To estimate F(p) from the data n, it is necessary to find the probability density function (pdf) 

P(pl n). First note that P(n p) = NlITfL 1 [pjVnj!] . By Bayes' theorem the pdf P(pl n) 

is given by 

P(pl n) = P(nl p)P(p)/P(n) (1) 
where P (n) = JdpP (n I p) P (p) , and where P(p) has support only on the simplex 
R = {p: p . > OVi^pj = 1 } . P (p I n) is called the "posterior pdf, P (n I p) is called the 

"likelihood", and P(p) is called the "prior pdf. Unless otherwise stated, integrals over p are un- 
derstood to be definite integrals over the region extending from to °° in each p { . Note that be- 
cause of cancellation, the constant NITI! 1 ^ ^1 does not appear in P (p I n) ; we can simply write 

P (p I n) oc p (p) nP_ p. 1 with the proportionality constant (dependent on n only) set by normal- 
ization. 

The pdf of F(p) given n is given in terms of P(p I n) by 

P(F(p) = f | n) = | dp 8(F(p) - f) P(p I n) (2) 
It is important to note that if what we know is n, and if what we wish to know is f, then it is the 
distribution in Eq. (2) and this one alone, which tells us what we want. Distributions which do not 
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depend on a prior (e.g., likelihood-based quantities) do not share this property. (Of course, they 
also don't share the property of distributions like P(F(p) = f | n) that, as usually calculated, such 
distributions depend on an assumption for the prior. See Ref. 1.) 

Rather than trying to find the density P(F(p) = f | n) directly, it is simpler to find the moments 

of F k (p) given this density. The k th moment of F(p) given n is given by 
Jdf f k P(F (p) = f I n) = Jdp F k (p)P (pin), i.e. the k th moment of F(p) given n is the pos- 
terior average of F k (p) according to the posterior distbution P (p I n) . Define F k by 

F k = JdpF k (p)P(p)nf 1 =1 p" 1 . (3) 
Using our formula for P (p I n) we see that the k th moment of F(p) given n is given by F k / F . 

We refer to this ratio as the "Bayes estimator with prior P(p) for F k (p)". 

As an aside, note that the moments are useful things to know even when the distribution in 
question is non-gaussian, so that knowing such moments doesn't directly give us things like the 

full-width-half-maximum of the distribution.. In particular, the B ayes-estimators for F (p) and 

F 2k (p) can be used to find the standard deviation of F k (p) . This in turn may be used with Cheby- 

shev' s inequality to bound the probability of deviation of F k (p) from the Bayes estimator for F k (p) , 
even if the posterior is non-gaussian. 

To proceed further it is necessary to make an assumption for the prior pdf P(p); once this is 
done, P (p I n) and F k / Fq are uniquely determined. In the calculations to follow, P(p) will be 
assumed to be a uniform prior, i.e. it will be assumed to have the form P (p) A(p)0(p), where 
0(p) = n.0(p i ), (9 is the theta function 0(x) = 1 for x > 0, otherwise), A(p) = 8(Z.p i - 1), and 

the proportionality constant is set by the normalization condition Jdp P (p) = 1 . 

We emphasize that here we are using the uniform prior only for reasons of expository simplic- 
ity. In many problems the uniform prior is inappropriate and a different prior should be used. As 
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an example of such a non-uniform prior, the entropic prior P(p) oc e , where S is the Shannon 

8 

entropy and a is some constant, is related to the technique of Maximum Entropy . (Also see foot- 
note 1.) As another example, the Dirichlet prior, P (p) E™ = .p? for some constant a, has also 

9 

been considered in some contexts . It is also sometimes appropriate to assign to some states i a pri- 

i 9 

or which does not allow pj to differ from zero . 

In Ref. 1 we consider the extension of our results to a broader class of priors than those consid- 
ered in this paper. In particular, both entropic and Dirichlet priors are discussed there. As a general 
rule, because there is no reason to believe that a Bayesian technique is optimal if the prior it uses 
is poorly chosen, we admonish the reader to choose each prior with careful attention to the problem 
at hand. 

For simplicity of presentation define 

I[F(p) ,n]sjdp F(p) A(p)0(p)n^ = lP "'. (4) 
Note that I[- , ■] is a functional of its first argument and a function of its second argument. With 
this notation the Bayes estimator with uniform prior for F k (p) (i.e. F k / F with P(p) uniform) is 

given by I[F k (p) , n] / I[l , n]. (For non-uniform P(p), F k / F is given by a different ratio of inte- 
grals.) 



3. BAYES ESTIMATORS MINIMIZE MEAN-SQUARED ERROR 

Before evaluating the integrals I[F k (p) , n] we briefly discuss an optimality property of Bayes 
estimators and relate these estimators to some classical estimation techniques. 

If the true probabilities are fixed to p, then the mean-squared error when using an estimator 
G(n) to estimate F(p) is given by 
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Z n P(nlp)x(G(n)-F(p)) 2 (5) 

For a given p, (5) is minimized by choosing G(n) independent of the n: G (n) = F (p) . More 
generally, when p is distributed according to P(p), the mean-squared error is given by 

I dp P(p) £ n P(n I p) x (G(n) - F(p)) 2 (6) 
As conventional, find the G( ) minimizing this expression by writing G( ) = Gq(-) + OCT|(-), differ- 
entiating (6) with respect to a, and evaluating the result at a = . Doing this yields 

Z n ri(n) I dp P(n I p) P(p) x (G (n) - F(p)) = 0. (7) 

Since this equality must hold for all rjQ, for all n 

I dp P(n I p) P(p) x (G (n) - F(p)) = 0. (8) 
Eqn. 8 is solved (assuming J dp P(n I p) P(p) ^ ) by 

G (n) = J dp P(n I p) P(p) F(p) / J dp P(n I p) P(p) = F : / F . (9) 
Note that Eqn. 9 holds for any prior P(p). Given the discussion in Sec. 2, Eqn. 9 shows that G (n), 
the estimator having minimal mean-squared error from F(p), is identical to the Bayes estimator for 
F(p): G (n) = JdpP (p I n) F (p) . 

As an example consider the famous Laplace Sample Size Correction estimator, in which the 
underlying pj are estimated from counts n by p l = (nj + 1 ) / (N + m) . This estimator is precisely 
the Bayes estimator with uniform prior for F(p) = p (see results in Ref. 1). Note that for small n^ 
the Bayes estimator is especially different from the frequency count estimator pj = nj / N. 

We note as an aside that when F(.) is highly nonlinear and not one-to-one (e.g., when F(.) is the 
Shannon entropy), one can not evaluate the Bayes estimate for F(p) by calculating F of the Bayes 
estimate for p, i.e., by calculating F((nj + 1) / (N + m)).) For these kinds of F(.) one must take into 
account the probabilities of all p's to evaluate the Bayes estimator for F(p). The set of the Bayes 
estimates for the individual pj simply does not contain sufficienct information to give the Bayes 
estimate for F(p). (Never mind enough information to calculate error bars for that estimate.) 

In general, one might not want to take the mean f according to the pdf P(F(p) = f n) to form 
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an estimate for F(p). For example, one might be interested in minimizing (the average of) 

IG(n) - F(p)l rather than [G(n) - F(p)] , a goal which generically results in an estimate of the median 
of the pdf rather than its mean. As another example, in maximum-likelihood estimation, for the 
case where F(p) = p, one estimates F(p) as the p that maximizes the likelihood P(n I p), rather than 
as a mean or a median. The maximum-likelihood estimate corresponds to finding the mode of 
P(F(p) = f | n) (assuming the prior over f is uniform). (When F(p) = p the result is the frequency 
counts estimate, pj = nj / N.) As yet another example, it might be of interest to minimize something 
other than a functional of the error G(n) - F(p). An instance of this appears in footnote 2, which 
discusses minimizing the mean-squared bias to find what might be called a "Bayes minimum-bias 
estimator". Finally, note that the classical sampling distribution problem (arising in hypothesis 
testing) of finding the distribution of counts n given a known value f of F(p), i.e. of finding 
P(n I F(p) = f), may be handled using the techniques developed in this paper for calculating the 

posterior P(F(p) = f I n) . We will discuss this issue in a later paper . 



4. CALCULATION OF THE BAYES ESTIMATOR FOR SHANNON ENTROPY. 

As was shown in Sec. 2, finding the Bayes estimator with uniform prior for F k (p) reduces to 

evaluating integrals of the form I[F k (p) , n]. This section contains the central techniques for cal- 
culating these integrals and uses them to calculate the Bayes estimator for the Shannon entropy. 

Readers interested only in the Shannon entropy results may skip directly to Sec. 4e. In Sec. 4a 
we derive an important result that allows integrals like I[-, ■] to be recast as Laplace convolution- 
products. In section 4b we outline the general procedure, based on the results of Sec. 4a, for cal- 
culating the moments of F(p). In the remaining subsections we apply the procedure of Sec. 4b to 
the case in which F(p) is the Shannon entropy. Section 4c contains a calculation of Fq = I[l , n]. 
In Sec. 4d we present a calculation for those integrals which, along with Fq, give the Bayes esti- 
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mator of the Shannon entropy. 

4a. CONVOLUTION FORM OF THE INTEGRALS 

In this subsection two important results are given. First, in Thm. 1 it is shown that if a function 

H(p) factors as H (p) = n™ = ^ (p t ) , then the general form of the integral Jdp H(p)A(p)0(p) is 

that of a convolution product of m terms (recall that m is the number of possible outcomes of the 
process under observation). Second, Laplace's convolution theorem is given. 

f x 

Define the Laplace convolution operator ® by (f ®g)(x) = dx f(x) g(x-x). 

Theorem 1. If H(p) = nj^jh-Cpj) then Jdp A(p)0(p) H(p) = (®| n =1 h i (p i ))(x)| _ . 

Proof: The p^ may not be independently integrated since the constraint Z™ = l p l = 1 exists. This 
constraint is reflected in the explicit definition of the integral, 

Jdp A(p)0(p) H(p) = J Q d Pl J dp 2 ... {h 1 (p 1 )x...xh m (p m )}x5(l-Zf =1 p i ) 

1 1 - Pi 

= JdPi h i (Pi) J dp 2 h 2 (p 2 ) ... 



1 - ( P1 + ... +p m _ 2 ) 

J dp m _ 1 h m _ 1 (p m _ 1 )h m (l- (p 1 + ...+p m _ 1 )) 

o 

Define the m variables x^, k = 1 , ... , m, recursively by x , = £™ = l p i and x k = x k _ , - p k _ { . Since 
x k = I. - ] Pi , our integral may be rewritten as 
Jdp A(p)0(p) H(p) = 

1 1 m— 1 

JdpjhjCp^Jdp^Cp^... J dp m _ 1 h m _ 1 (p m _ 1 )h m (x m _ 1 -p m _ 1 )|^ =i . 
1 
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Now, with the definition of the convolution the integral can be rewritten as 
Jdp A(p)0(p) H(p) = 

1 m- 2 

| d Pl h l(Pl)--- { d Pm-2 h m-2(Pm-2)( h m-l® h m-2)( X m -2-Pm-2)| T 



■ - 1 



Since the convolution operator is both commutative and associative, we can repeat this procedure 
and write the integral above with obvious notation as 

Jdp A(p)0(p)H(p) = (®^ =l h i (p i ))(x)\ x 



l 

QED. 



Theorem 2 is the Laplace convolution theorem and is stated for completeness only. The proof 

oo 

may be found in Ref. 10. Define the Laplace transform operator L by L[ h ](s) = |h(t)e~ st dt. 

o 

Theorem 2. If L [h^)] exists for i = 1, m, then L [®" = fifa)] = TI™ = { L [hj(pj)] . 

4b. OUTLINE OF GENERAL PROCEDURE 

Theorems 1 and 2 allow the calculation of integrals I[F (p) , n] for functions of the form 
F (p) = 1 I1P = 1 h i j(pj), which we call "factorable". Here we briefly summarize the procedure 
to be used. 



i) For each hjj(pj), calculate the Laplace transform of hy(pj) x pj n j. 

ii) Calculate E^ =1 j L^pp x Pj n j]. 

iii) Take the inverse Laplace transform of the term calculated in (ii). 
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As an example, let F(p) = S(p) = -£?L jPilog (p^ . All powers of S(p) are factorable terms. 

Therefore, the procedure outlined above may be used to find the Bayes estimators with uniform 
prior for any power of S(p), as shown in detail in the remainder of this section. 

4c. CALCULATION OF I[l , n] 

In the next theorem the Laplace transform is used in concert with Thms. 1 and 2 to calculate 
the normalization constant Fq = I[l , n]. Defining the Gamma function T(z) by 

oo 

T(z) = Jt z_1 e _t dt for Re(z) >-l we have 
o 

nPLjiTnj+i) 



1 |m 

Theorem 3. If Re(nj) > -1 V i = 1, m, then I[l , n] 



T[N + m] ' 

Proof: For the integral I[l , n] = |dp A(p)0(p)n| n = jp" 1 , the hj(pj) of Thm. 1 are given by 
h i(Pi) = p"'- Since 

r(n + 1) 

L[p n ](s) = forn>-l, 

s 

we have by Thms. 1 and 2 

r(n i+ l) 



I[l,n] = L- 1 [n| n _ 1 L[p n i](s)]| = L" 

1-1 1 T = 1 



nPL^+i) 



n m — - (x) 

S J T= 1 



T(N + m) ' 

QED. 



4d. CALCULATION OF I[ p^ 1 log^pj) ... P m q m log r m(p m ) , n ] 

As mentioned in Sec. 4b, since S (p) = -E^log (pj) , powers of S(p) are sums of terms each 
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of which have the form pj'log r '(p 1 ) . . .p^log rm (p m ) . Thus, to find the Bayes estimators for arbi 



trary powers of the Shannon entropy, expressions of the form 



I[p>g ri (p 1 )...p>g 1 ™(p m ),n] 



must be calculated. Using the fact that 3^ p n = p n log r (p), we immediately have 



Theorem 4. For Re (n^ > -1 Vi, 



I[ log r l ( Pl ) ...logV ( Pm ) , n] = a^...a^m l[l , n]. 



The justification for the needed interchange of derivative and integral is given in App. C of Ref. 1. 



Make the definitions 4> (n) (z) = ¥ (n ~ 1} (z) , and A<J> (n) (z v z 2 ) = 4> (n) (Zj) - 4> (n) (z 2 ) , where 

*P ^ (z) is the polygamma function ^ (z) = 3" + Hog (T(z)) . This definition of 4> is made 

to facilitate the clean presentation of results; 4>^(z) = d n z log(r(z)). 

Thms. 5 and 6 apply Thm. 4 to the calculation of the integral I [log r l (pj) ...log r m (p m ) , n] 
for some special cases. 

Theorem 5. For Re (rij) > — 1 Vi, 



In using Thm. 4, note that since N = E m . n- we have d .N = 1 . 
° i=i i i 



I[log(p u ), n] = {n { r(n i+ l)} / r(N+m) x A$ Vi; (n u +l , N+m). 



Proof: 



I[log(p u ), n] = 3 nu I[l, n] 



(by Thm. 4) 



Substituting the result from Thm. 3 for I[l , n] above we find 



= 3, 



r=i r (n;+l) 



= n. r( n . + i)3 

1 ^ U v 1 ' 



nn u +D 

n u r(N + m) 



n, 



r(N + m) 



u 
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r(n u +i) 

n. ^ u r( ni + 1) u + xA4> (1) (n u +l,N + m) (by Def. of 4>) 

rr-^+i) 

r(N + m) xA^ (1) (n u +l,N + m). QED. 



Theorem 6. For Re (rij) > -1 Vi, 

i) I[ log(p u ) log(p v ) , n] = {n i r(n i+ l)} / r(N+m) 



x { A4> (1) (n u +1 , N+m) A4> (1) (n y +1 , N+m) - 4> (2) (N+m) }. u * v . 



ii) I[ log 2 (p u ) , n] = {II. r(n i+ l)} / r(N+m) 



x { A4> (1) (n u +1 , N+m) 2 + A4> (2) (n u +1 , N+m) }. 



Proof: Similar to proof of Thm. 5. 

4e. THE BAYES ESTIMATORS FOR MOMENTS OF THE SHANNON ENTROPY. 

In this subsection the results for the Bayes estimators with uniform prior for the first two pow- 
ers of S(p) are given, i.e. Sj / S and S 2 / S . Refer to Sees. 4a-d for the calculations used here, 

and to Sec. 4d for the definitions of the functions <J> (n) (z) and A4> (n) (Zj, z 2 ) . 



Theorem 7. For Re(n-) >-l Vi, Si / S = -E.^— ■ A$ (1) (n-+2 , N+m+1). 



Proof: Sj / S = I[- Z { p { log( Pi ) , n] / I[l , n] 

= -Ej I[log( Pi ) , n+ej / I[l , n] (by Def. of I[-, ■]) 

= -E. 3 I[l , n+ ei ] / I[l , n] (by Thm. 4) 
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= A<£ (1) (n-+2 , N+m+1) (by Thms. 5, 3). 

'N + m v i ' ' 

QED. 



Theorem 8. For Re (n^ > -1 Vi, S 2 / S 







(11;+ 1) (n.+ 1) m m m 

E. - — J — { A$ Uj (n-+2 , N+m+2) A4> U) (n-+l , N+m+2) - <S> w (N+m+2) } 

l *J (N + m) (N + m+ 1) V J v ' 



(rij+ 1) (n; + 2) m « f2) 

+ £ i (N + m) (N + m+1) X { A * ( V 3 • N+m+2 > + A^V 3 , N+m+2) }. 



Proof: Similar to proof of Thm. 7. 

In a manner similar to the calculation of Sj and S2, all higher moments of S(p) are calculable 
via differentiation (since d z <t> ^ (z) = 4> * n + ' (z) ). Note that when no data have been observed, 

i.e. n = 0, the estimator for Si / Sq is simply -A<J> (2) (2, m + 1) = Z™ = 2 i _1 . It should also be noted 
that as N -> 00 the Bayes estimator Sj/Sq — > -£. (nj/N) log (nj/N) , i.e. it asymptotically be- 
comes the frequency-counts estimator 11 . 



5. BAYES ESTIMATORS VS. FREQUENCY-COUNTS ESTIMATORS. 



In this section we compare the Bayes estimator (see Thm. 7) and the frequency-counts estima- 
tor for the entropy in two ways. First, in Sec. 5a an explicit calculation of the two estimators is 
made for two specific cases when a small number of counts are observed in two bins (m = 2). This 
simple calculation points out that for small N there are significant differences in the values of the 
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two estimators. Second, in Sec. 5b the two estimators are graphically compared for a range of sam- 
ple sizes and true underlying distributions. 

5a. SMALL N 

For small N, the Bayes estimate / Sq can differ considerably from the estimate one would 
make using the frequency-counts estimator, S (n) = — j (nj/N) log (n ; /N) . This is illustrat- 
ed by the following pair of examples: 

Example 1 : Assume two possible events (m = 2). Let nj = 0, and n 2 = 2. Sj / S = .458. The entropy 
estimate obtained using the frequency-counts estimator is 0. 

Example 2: Again, have m = 2. Assume that nj = 1 and n 2 = 4. S j / Sq = .533. The entropy estimate 
obtained using the frequency-counts estimator is 0.500. 

Note that there are "edge effects" in using S\ I Sq as the estimate for the entropy. If the true p 
is uniform (p- = m" 1 Vi), then S(p) is maximal and always exceeds Sj / S , no matter what the ob- 
served n are. This is because the estimate Sj / Sq takes into account all possible p which might have 
generated the observed n, including all those with a smaller entropy than the true (maximal) entro- 
py. In a similar fashion, if the true S(p) is minimal then it is always exceeded by Sj / Sq, regardless 
of the value of the observed n. 

5b. GRAPHICAL RESULTS OF NUMERICAL COMPARISONS. 

The graphs appearing in Figs. 1-5 depict several comparisons of the Bayes and frequency- 
counts estimators for entropy. In all cases the solid line represents the Bayes estimator, the dash- 
dot line represents the frequency-counts estimator, and the dotted line represents the true value of 
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the entropy, where applicable. Figure 6 depicts the pdf of the Bayes estimator for a fixed ratio of 
counts as the number of counts increases. The graphs are the result of exact numerical computa- 
tions of the various quantities represented. 

Figure 1 explicitly demonstrates the result of Sec. 2 of this paper for the Shannon entropy with 
m = 2. Recall that this section shows that the Bayes estimator is the minimal mean-squared error 
estimator. As is immediately seen in Fig. 1, for all N the Bayes estimator has a smaller mean- 
squared error than the frequency-counts estimator, where the mean-squared error for an estimator 
S(n) is given by 

I dp P(p) E n P(n I p) {S(n) - S(p)} 2 . (10) 
The curves were generated with P(p) uniform. The Bayes estimator is that of Thm. 7 which as- 
sumes this uniform P(p). 

Figure 2 depicts the average over p of the sample variance, that is 

I dp P(p) E n P(n I p) {S(n) - E n > P(n' I p) S(n')} 2 (1 1) 

This figure shows how, for a particular sample size N, the estimators deviate from their sample av- 
erages. It is immediately seen that the Bayes estimator has a smaller sample variance. (This is in 
agreement with the conservative "edge effects" behavior of the Bayes estimator which was men- 
tioned in the preceding subsection.) This result is useful for understanding Figs. 3 and 4. 

Figure 3 shows the sample averages of the estimators as functions of the sample size N for var- 
ious values of the true p, that is 

E n P(nlp)S(n). (12) 
Figure 4 shows the same sample averages of the estimators, but now as functions of the true p 

for various values of the sample size N. 

It is of interest to note that the sample average of the frequency-counts estimator actually comes 

closest to the true entropy for a range of p values and sufficiently large N (see Figs. 3d-f and 4d- 

f). To see how this is possible in light of the fact that the Bayes estimator has lower mean-squared 

error, first note that 
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IdpP(p)Z n P(nlp){S(n)-S(p)} 2 = (13) 

I dp P(p) Z n P(n I p) {S(n) - V S(n' ) P(n' I p)} 2 + J dp P(p) {£„ P(n I p) S(n) - S(p)} 2 , 
i.e. the mean-squared error is the sum of the mean sample variance and the mean-squared bias. The 
left hand side of Eqn. 13 is depicted in Fig. 1 . The first integral on the right hand side is depicted 
in Fig. 2. The integrand of the last integral on the right hand side (excluding P(p)) appears in Fig. 
3 as the square of the difference between the curve for the estimator, and the true value being esti- 
mated. This quantity favors the frequency-counts estimator for some values of p for sufficiently 
large N; however the first integral on the right more than compensates to give a result favoring the 
Bayes estimator. 

Figure 5 depicts the sample averages of the estimators' deviations from true as a function of p 
for various sample sizes N, 

Z n P(nlp){S(n)-S(p)} 2 . (14) 
The integral of the expression in (14) multiplied by the density P(p) (here uniform), depicted for 
various N, is shown in Fig. 1. 

Finally, Fig. 6 shows the convergence of the pdf P(s I n) given by 

P(S(p) = s I n) = J dp 8(S(p) - s) P(p I n) (15) 
for a fixed ratio (1 : 15) of observed counts n^ : n2 ? as the overall number of counts N = n^ + n2 

increases. Note the increasing density placed upon the true entropy as the counts N increase. Note 
that the average of s acording to this density P(s I n), i.e. I ds s P(s I n), is the Bayes estimator for 
S(p) given the observations n. As mentioned previously, of all estimators, its squared error aver- 
aged over both p and n is minimal. 
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FOOTNOTES 

1 

If one wishes to estimate p by finding the mode of P(p I n), then the entropic prior leads imme- 
diately to the technique of MaxEnt in the case where the data are not a finite vector of counts n but 
rather is some expectation value, b = 1^ p^ B(pp. This follows since P(£j p^ B(pp = b I p), consid- 
ered as a function of p with b fixed, is everywhere either 1 or 0. As a result, by Bayes' theorem, 
for a prior of the form e a ^P\ finding the mode of P(p I p^ B(pp = b) is equivalent to maximiz- 
ing S(p) subject to 2^ p i B(pp = b. 

2 Note that what we have shown here is that F 1 / F has the least mean- squared error from the true 
F(p), on average. This does not imply that it is the estimator of the entropy which is least biased 
on average. To find the least average bias estimator, one searches for the G( ) minimizing 

I dp P(p) [ 2L n [ P(n I p) x G(n) ] - F(p) ] 2 . 

It is more complicated to find the G( ) minimizing this expression than it is to find the G( ) mini- 
mizing the expression in Sec. 2. Setting up the problem analogously to Sec. 2, we get 

= |dp P(p) [ 22 n [ P(n I p) G(n) ] - F(p) ] x [ 2L n P(n I p) x r|(n) ]. 

As usual, for this to hold for all T|( ) means that it must hold for T|( ) a Kronecker delta function 
centered about any particular pattern n. Let k be any such pattern. We have 

= I dp P(p) P(k I p) [ Z n P(n I p) G(n) - F(p) ], 
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£ n G(n) I dp P(p) P(n I p) P(k I p) = J dp P(p) P(k I p) F(p). 



We can evaluate both of the integrals for any n and k (see Sec. 3). What we then have is a set of 
simultaneous equations, one for each value of k. Each of the equations is of the form "linear com- 
bination of the G(n) equals constant". (The linear combination being over all possible n.) To solve 
for the least average biased G( ), we must solve this set of simultaneous equations. 
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FIGURE CAPTIONS 



Figure 1 shows the mean square error (Eqn. 10), for the Bayes (solid) and Fre- 
quency Counts (dot-dash) estimators of the entropy S (p) = -£? = jPilog (Pj) . 

Figure 2 shows the mean sample variance (Eqn. 11), of the Bayes (solid) and 
Frequency Counts (dot-dash) estimators of the entropy 
S (p) = -£? = jPilog (Pi) • Both variances are shown as functions of the sample 
size N. 

Figure 3 shows the sample average (Eqn. 12), for the Bayes (solid) and Fre- 
quency Counts (dot-dash) estimators of the two bin (m = 2) entropy S(p). Both are 
graphed as functions of N for various values of p = (p , 1-p). The true value of 
S(p) is also graphed (dashed). 

Figure 4, like figure 3, shows the sample average (Eqn. 12), for the Bayes 
(solid) and Frequency Counts (dot-dash) estimators of the two bin (m = 2) entropy 
S(p). However, in figure 4 both are graphed as functions of p = (p, 1 - p) for 
various values of N. The true value of S(p) is also graphed (dashed). 

Figure 5 shows the sample average deviation from true (Eqn. 14) , for the 
Bayes (solid) and Frequency Counts (dot-dash) estimators of the two bin (m = 2) 
entropy. Both are graphed as functions of p = (p , 1-p) for various values of N. 
The integral over p with the density P(p) appears as figure 1. 



Figure 6 shows the posterior pdf (Eqn. 15) of the entropy S(p) for m = 2 and 
fixed counts ratio n 1 : n 2 = 1 : 15, but differing overall N = nj + n 2 . As N 
increases, the density converges to a delta function at the value s = S(l/16 , 15/16) 
= 0.2338 of the entropy . 



